0. Kaggle Description

0.1: Introduction

Kobe Bryant marked his retirement from the NBA by scoring 60 points in his final game as a Los Angeles Laker on Wednesday, April 12, 2016. Drafted into the NBA at the age of 17, Kobe earned the sport’s highest accolades throughout his long career.

Using 20 years of data on Kobe's swishes and misses, can you predict which shots will find the bottom of the net? This competition is well suited for practicing classification basics, feature engineering, and time series analysis. Practice got Kobe an eight-figure contract and 5 championship rings. What will it get you?

Acknowledgements

Kaggle is hosting this competition for the data science community to use for fun and education. For more data on Kobe and other NBA greats, visit stats.nba.com.

0.2: The Data and Task

This data contains the location and circumstances of every field goal attempted by Kobe Bryant took during his 20-year career. Your task is to predict whether the basket went in (shot_made_flag) .

We have removed 5000 of the shot_made_flags (represented as missing values in the csv file). These are the test set shots for which you must submit a prediction. You are provided a sample submission file with the correct shot_ids needed for a valid prediction.

To avoid leakage, your method should only train on events that occurred prior to the shot for which you are predicting! Since this is a playground competition with public answers, it's up to you to abide by this rule.

0.3: Submission Type

For each missing shot_made_flag in the data set, you should predict a probability that Kobe made the field goal. The file should have a header and the following format:

shot_id,shot_made_flag

1,0.5

8,0.5

17,0.5

They will be evaluated through the log loss cost function.

1: Overview

1.1: Let's start with a quick overview of the data

From the looks of our above table, we have some locational data, a fair number of categorical features, and the respresentation of id-based features, which are unlikely to be used later on. We seem to have a good number of total observations (not big data, but small data constrictions are unlikely).

1.2: How about missing values?

We have 5000 missing values total; as stated in the Kaggle description, this is our test set, so we'll be sure to extract them.

In other words, we will not have any missing values once the split occurs.

1.3: Misleading Values

We're looking at the numeric features to find any potential outliers. This also gives us a good look at them in general. Most don't provide much relevant information; the id-based features pollute this section, and there are a few binary features that should be considered as categorical.

An interesting result to note: the shot_distance attributed have a minimum value of 0. This could realistically happen (shots made as dunks and the like), but I want to make sure that they aren't standins for missing values.

1.4: Physical Distribution of Shots

This is the career of a professional basketball player visualized.

Purely based on observation, it seems that Kobe performed the majority of his shots from within the three-point line. It also looks like there is no clear relationship between his successful shots and where along the extended free throw line he was (i.e his location along the y-axis did not seem to matter).

1.5: The Target Feature

Ignoring the observations in the test set, it seems that Kobe missed slightly more shots than he made, averaging around a 44.6% shot completion rate. As is expected, the target feature calls for binary classification, and there is no need for any form of transformation.

2: Splitting the Data

2.1: Separate Testing Set Based on Aforementioned Kaggle Requirements

Now that we have our training and testing sets, we can safely perform some EDA.

3: Exploratory Data Analysis (EDA)

3.1: Shot Distance

Let's look at my prior concerns about potential irregularities in the shot_distance feature.

Remember that the lowest value was zero, which I was worried may have been a standin for a missing value. That said, it is more likely that these values represent dunks made by Kobe. We'll be making sure.

Its clear that most of the our shots occurred from at the net. These can include dunks, layups, and close shots. Let's get a little bit of accounting done.

After looking at the type of shots that had occurred at the net (i.e. with a distance of 0), you can tell that those values are genuine rather than missing. As surmised, layups and dunks accounted for the vast majority of the close shots, so they are definitely not standins for missing values.

That being said, how does the shot's distance relate to its scoring chance?

Knowing even a little bit about basketball, you could likely guess that a shot's distance influenced whether or not it would be a field goal or miss. This boxplot just confirms that relationship. Shots made closer to the net are far more likely to succeed than those attempted from farther away.

In addition, according to the data, shots attempted from more than 45 feet never succeeded.

3.2: Action Type vs Combined Shot Type

Next up we have two categorical features; at first blush, action shot and shot type seem very similar since both essentially describe the type of shot performed by Kobe. They overlap heavily, so we need to see their relationship as well as how they collectively correlate to shot percentage. What I ended up finding is that both have problems, which will be showcased below:

3.2.1: Problems with Action Type

Let's look at both:

From this point onwards, the distribution has an ever longer tail until we get to categories that consist of one observations. This is a skewed distribution at the end of the day.

The problem with this attribute is that there are so many useless categories, each of which will end up becoming a dimension when modeling. This would cripple performance and harm the predictive power of many models. As such, adding it would be problematic.

3.2.2: Problems with Combined Shot Types

As the name suggests, this attribute combines the categories of the prior setup.

Although it has fewer categories, it does suffer from the fact that aggregates too much. We'll see below.

Above you see two different styles of layup on the left and the generalized version to the right. The Driving Layup Shot and Normal Layup Shot are both heavily weighted towards either missing or scoring. The aggregated layup is less polarized. We lose knowledge by replacing the two on the left with that one the right.

3.2.3: Solution

We need to make our own version of the Combined Shot Type feature that avoids being too generalized or having that long-tail problem.

Step 1: Identify which action types are mapped to which combined shot types

Step 2: Find action types within a combined category holding more than 5% of the total observations.

Step 3: Pull them out as their own label and leave the rest as an 'other' category that contains rarer labels.

Now we have a nice middle ground. There are 12 labels in this new feature rather than almost 70, and each of these labels provide a close look at their observations.

3.3: Shot Type

I'm bringing back our prior image to showcase this feature.

Although there is a clear connection between the type of shot made and its percent chance to become a field goal, there is also a relationship between it and the shot distance showcased earlier. The two features are covariant, meaning this feature will be dropped later on.

3.4: Shot Zone Area

Although this feature suffers from the same potential for covariance with shot distance that shot type did, this variable differentiates itself by incorporating the shot angle into the equation. That aspect is not naturally within this dataset.

For example, it's clear that Kobe managed greater success when shooting from the center rather than the left or right, despite the fact that they (mostly) were taken from comparable ranges.

We end up creating a shot angle feature later on, which covers the information relayed by this attribute, so we will keep this away from the model.

4: Feature Engineering and Selection

Although we've run through and selected (or disqualified) a few features, those decisions were pretty straightforward, and they were direct byproducts of my exploratory data analysis. In this next section, we will be looking through other attributes that either need to be altered or have a more complicated relationship with others in the set.

4.1: Season and Game Date

Here we have two features that represent a similar aspect of Kobe's performance, specifically when throughout his career he made a shot. Season represents this as a categorical feature while game date does so more continuously.

Before investigating though, we'll be cleaning these two features by breaking them down into their parts.

Now that everything has been disassembled, we'll treat this as a time series, meaning we will look for trends and seasonality. Two questions to keep in mind:

  1. How did Kobe's shot count change depending on the year?
  2. How did his accuracy change, and how did the shot count play a role in it?

This visualization provides some clear answers to our prior answers earlier:

  1. Towards the beginning and end of his career, he took less shots when compared to the middle of his career.
  2. The accuracy could be generally described as consistent from 1999 to 2013. Before 1999, the variance of per-game accuracy was wild due to the low number of shots taken per game (lots of 100% and 0% there). After 2013, you can see a drop in accuracy, likely caused by injuries taken during the brutal 2013 season.

Generally, there seems to be no seasonality. Kobe, professionally, did not care much whether or not it was November or March.

In short, this set of features can be boiled down to the predictive power of the season Kobe played in. Let's look closer at it.

It seems that the late-90s and early 2000s were pretty good years for Kobe's accuracy. There's a decent range invovled with this feature, but the deviation is not that high. We'll keep the season feature for the model, but the game date and the seasonality components can be dropped.

4.2: What About Playoffs?

Generally, Kobe didn't seem to be affected by playoff pressure. These results would push us towards getting rid of the feature as a predictive variable.

4.3: In-Game Temporal Features

Our next features are all connected in that they describe the points during individual games that Kobe made or missed a shot. What's interesting about these features is that there is an argument to both combine them into one overall clock and not to.

For: Combinging them all into one would allow us to track shots continuously throughout the game.

Against: This is not exactly how basketball works; shots are more likely to happen as the timer runs down, especially as the number of periods drags on. A continuous clock would falter when explaining hopeful shots made just before halftime and the like.

Continuous intervals interrupted by quarter starts and stops seems to be the best way forward. I will merge the minutes and seconds features into one but leave the quarter variable alone. In short, I will be treating this also like a time series.

There's a clear difference in shot quality by period, primarily between "normal" periods and irregular ones.

How does the minute marker affect performance?

We seem to have reasonably consistent performance across minutes throughout a quarter with one exception. As the last minute runs down, the accuracy suffers.

Remember that many shots were attempted in the last second; these would reasonably have a lower accuracy than their normal counterparts, so we should create a feature that differentiates the two.

4.4: Last-Second Shot

The average scoring rate for shots made in the last second are just under 20%. This is a clear drop in accuracy from the general average, but we are representing a small subsection of the total population (~600 vs. 36000).

What if we check how all shots in the final minute compare?

The average scoring rate for shots made in the last minute is significantly higher than the last second but not as high as the general average. This makes intuitive sense since the last minute is a middle ground between normal play and a desperate shot at the last second.

It seems that both the last second and last minute shots would be beneficial in classifying a shot as being a field goal or a miss, but they would likely have a fair degree of covariance. Let's see if its a problem.

With a correlation of 0.35, they have a little bit of covariance but not enough to be considered problematic alone. We'll add both features.

4.5: Game Location

Next up, how does the game's location affect Kobe's shot performance? Does he have a home turf advantage?

Surprisingly, there does not seem to be a difference between Kobe's performance away versus at home.

4.6: Opponent Quality

The quality of the opposing team is also a theoretical factor for the number of shots Kobe would make versus miss.

According to the shot scoring percentage by team, there is not a noticeable difference in performance based on the opponent Kobe faced. If you look at the distribution of field goal percentage on a per-team basis, there is no clear relationship.

We'll remove opponents as a predictive feature.

4.7: Shot Angle

This was a late addition. After flirting with making do with the shot zone area, I decided to calculate the angle at which Kobe looked to make a shot.

This visual may help inspire:

After seeing this, it may come to mind that you can find the shot angle by utilizing three points:

  1. The hoop at the origin (0,0)
  2. The sideline at (250,0)
  3. The point at which the shot was taken at (x, y)

Let's take one observation as an example:

Now that we have our vectors, let's find the calculation of the angle between them. This is formulated through the cosine, which is reached by finding the dot product between the two vectors divided by the product of each vector's magnitude.

It seems that the angle for the above observation was 104.79 degrees. There is a slight problem with the new feature: the relationship is parabolic. You could see this below.

Kobe has the best chance of sinking a shot when he's directly in front of the net. In that position, he can take full advantage of the backboard. As the angle becomes more extreme (in either direction), the benefit decays.

We fixed this by adjusting the feature to represent angles in a manner where the straight shot is 0 and anything that isn't is a higher number. We subtracted 90 from the angle and then calculated the absolute value. Shots from 90 degrees become 0, and those from 0 or 180 degrees become 90. In this way the relationship becomes a negatively correlated and linear.

We now have our shot angle feature! As a result, the lower the adjusted angle was, the more likely Kobe was to score.

5: Setting up the Pipeline

5.1: What are our Features? (Summary)

  1. Shot Type: What type of shot style did Kobe use? (layup, dunk, hook shot?)

  2. Period: What point of the game did he shoot?

  3. Shot Zone: At what section of the court was Kobe situated when he shot?

  4. Shot Distance: How far was he from the hoop?

  5. Season: At what stage of his career did he make that shot?

  6. Last Second Check: Was the shot made as the clock ran out?

  7. Last Minute Check: Was it made when the clock was running out?

5.2: Transformation Functions

5.3: Applying the Column Transformer to the Training Set

6: Modeling

6.1: Model Evaluation Function

Now that everything is sorted, we'll begin modeling. Let's start by creating a function that will fit and apply a model of our choice before providing several evaluation metrics to guage each model's predictive power.

We picked accuracy as a baseline metric due to its comparability. We'll also provide a confusion matrix, the F1-Score (which incorporates both Precision and Recall), and a visualization of the latter.

6.2: Dummy Classifier

As expected, the model performed poorly. We will at least be able to compare more complicated models against it.

6.3: Decision Tree Classifier

The decision tree performed markedly better than the baseline, but 60% is not too exciting either. That said, we have more powerful models coming up.

6.4: KNN

6.5: Logistic Regression

Another noticeable improvement. Let's see if we can do better.

6.6: Naive Bayes

6.7: SVM

The SVM's predictive power was comparable to the Logistic Regresssor.

6.8: Random Forest Classifier

6.8: Adaboost

6.9: Xgboost

7: Fine-Tuning Promising Models

7.1: KNN

7.2: SVM

7.3: Random Forest

7.4: Adaboost

7.5: Logit Regression

8: Voting Classifier Versus Stacked Classifier

9: Predicting the Test Set

9.1: Classification

9.2: Probabilities

The Kaggle competition requires submissions be in the form of probabilities rather than predictions. Personally, I see the above sections as the real output, but we'll perform these steps as well.